# Grobid

GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.

It is designed and expected to be used to parse academic papers, where it works particularly well.

*Note*: if the articles supplied to Grobid are large documents (e.g. dissertations) exceeding a certain number
of elements, they might not be processed.

This page covers how to use the Grobid to parse articles for LangChain.

## Installation
The grobid installation is described in details in https://grobid.readthedocs.io/en/latest/Install-Grobid/.
However, it is probably easier and less troublesome to run grobid through a docker container,
as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-docker/).

## Use Grobid with LangChain

Once grobid is installed and up and running (you can check by accessing it http://localhost:8070),
you're ready to go.

You can now use the GrobidParser to produce documents
```python
from langchain.document_loaders.parsers import GrobidParser
from langchain.document_loaders.generic import GenericLoader

#Produce chunks from article paragraphs
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()

#Produce chunks from article sentences
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()
```
Chunk metadata will include Bounding Boxes. Although these are a bit funky to parse,
they are explained in https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/